feat(e2e-harness): drive and snapshot the real wizard TUI#702
Merged
Conversation
🧙 Wizard CIRun the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands: Test all apps:
Test all apps in a directory:
Test an individual app:
Show more apps
Results will be posted here when complete. |
gewenyu99
commented
Jun 22, 2026
gewenyu99
commented
Jun 22, 2026
gewenyu99
commented
Jun 22, 2026
gewenyu99
commented
Jun 22, 2026
gewenyu99
commented
Jun 22, 2026
gewenyu99
commented
Jun 22, 2026
gewenyu99
commented
Jun 22, 2026
gewenyu99
commented
Jun 22, 2026
gewenyu99
added a commit
that referenced
this pull request
Jun 22, 2026
…ord/replay A control plane over the TUI store that drives the wizard end-to-end with no terminal and no browser, for CI/e2e and agent-driven testing. The render is a pure function of the nanostore, so driving committed state == driving the UI. Core files (src/lib/ci-driver/): - wizard-ci-driver.ts — read_state / list_actions / perform_action over a live WizardStore. read_state is a truthful, secret-free projection of committed state (+ derived currentScreen); perform_action commits via the exact store setter the Ink screen's key handler calls. - action-registry.ts — declarative screen -> commit-action map (exhaustive over ScreenId/Overlay). The actuation surface: name an action, not a keystroke. - wizard-ci-tools.ts — in-process MCP server exposing the three tools, so an external harness or LLM can drive a real run. - e2e-profile.ts — WizardE2eProfile: a program's declarative e2e test definition (the UI choices). decideE2eAction(state, profile) maps screen -> commit, so the harness is generic and the choices live on the program. - recorder.ts — captures a frame at each key moment (route/task/status/runPhase/ overlay change) off the store's version counter; redacts the access token. - replay.ts — reconstructs a throwaway store per frame and renders the REAL Ink screen back to ANSI, so a run replays in the terminal. - DRIVING-E2E-FROM-AN-AGENT.md — how a future agent drives these. - __tests__/ — control-plane walk, flow snapshot (TUI-snapshot analog), recorder. Programs declare their flow's UI choices: - programs/program-step.ts — ProgramConfig.e2e?: WizardE2eProfile. - programs/posthog-integration/index.ts — the integration program's e2e profile. Harness/entry scripts: - scripts/e2e-full-run.no-jest.ts — headless full run: real WizardStore + InkUI (never rendered) + concurrent driver + real runAgent; emits a structured result + a recording. - scripts/replay-e2e.no-jest.ts — replay a recording in the terminal. - scripts/ci-driver-demo.ts — offline control-plane demo (no agent). Additive; no core wizard behavior changed. The workbench `wizard-ci --e2e` (PostHog/wizard-workbench) orchestrates these against real test apps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The e2e UI-choices object moves out of index.ts into a co-located e2e.ts (POSTHOG_INTEGRATION_E2E_PROFILE), keeping the program config lean and the flow's test definition in its own file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/record-demo.no-jest.ts — produces a recording offline (no agent, no network) by driving the integration flow with the e2e profile + a WizardRecorder, so `replay-e2e.no-jest.ts` can be tried without a full run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/README.md documents the manual control-plane + record/replay tools (what each does, what it needs, how to run). Also commits ci-driver-live-agent.ts (real gateway LLM drives the wizard-ci-tools MCP server) so the index is complete. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
main added two confirm-and-continue intro screens (WarehouseIntro, SelfDrivingIntro, both call store.completeSetup()). The action-registry exhaustiveness test flagged them as uncovered. Register both as confirm_setup in ACTION_REGISTRY and in the e2e walk policy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l refs Move DRIVING-E2E-FROM-AN-AGENT.md → ARCHITECTURE.md to match the co-located subsystem-doc convention (cf. programs/self-driving/ARCHITECTURE.md). Remove content that shouldn't ship in the public repo: the internal test project id + team name, the workbench test-api-key.txt secret file, and pointers to workbench-only scratch files. Keep the architecture, profiles, record/replay, and MCP-loop guidance; generalize the run instructions. Update the scripts/README link. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/render-snapshots.no-jest.ts renders every key-moment frame of a recording to a real-Ink ANSI snapshot (one <seq>-<screen>.ans per frame), via replay's renderFrame under tsx. These feed the workbench visual-regression flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
None of the control-plane / recording / e2e machinery belongs in the wizard's production source. Relocate src/lib/ci-driver/ → e2e-harness/ at the repo root (next to e2e-tests/), and sever every prod coupling: - Remove the ProgramConfig.e2e field (program-step.ts) and the on-program profile (delete posthog-integration/e2e.ts, unwire index.ts). Per-program profiles now live in the harness — e2e-harness/profiles.ts, profileFor(programId). - Add an @e2e-harness/* path alias (tsconfig.build.json + jest moduleNameMapper); repoint scripts/tests off @lib/ci-driver. Result: src/ has ZERO references to the harness, and the published tsdown bundle contains none of it (previously the ~90-byte profile object shipped). Full suite (1045 tests, 3 snapshots) passes; real-recording render verified under tsx. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ARCHITECTURE.md now documents the wizard-ci-snapshots visual-regression flow (real run → render → diff → side-by-side report) and the env it needs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gram A test/ README documents this program's e2e test definition — the path the headless run walks and the option it auto-takes at each screen (confirm intro, dismiss outage, first setup option, skip mcp/slack, delete skills). It's the human description; the runnable profile stays in e2e-harness/profiles.ts. No e2e machinery returns to prod src — this is documentation only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oads Each program declares its e2e test path as src/lib/programs/<program>/test/e2e.json — a `profile` (the options the headless run auto-takes) plus a documented `path` of every screen. The harness imports the `profile` in e2e-harness/profiles.ts (single source of truth, no prose duplication). Matches the repo's existing JSON-data pattern (mcp-role-prompts.copy.json); resolveJsonModule already on. It's data, imported only by the harness — zero prod imports, absent from the tsdown bundle. Full harness suite + runtime load verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the end-to-end trace (agent → perform_action → driver → action-registry → store.completeSetup → emitChange → router re-resolve → readState) as a comment at the perform_action tool, with cross-referenced breadcrumbs at the driver hop (one committed mutation per call) and the action-registry hop (the store setter + flag-flip the screen sequence reacts to). Harness-only; prod store.ts untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…dule Add a header note to wizard-ci-tools / wizard-ci-driver / action-registry / recorder / replay: each lives in e2e-harness/, is imported only by scripts/tests, and is absent from the tsdown bundle (bin.ts is the only entry). Addresses the "this looks shippable" worry right where a reader meets the code (esp. the MCP server + SDK import). Verified: no e2e symbols in dist/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Moving the trace / never-ships / credentials notes to PR review comments anchored to the lines instead — keep the source uncluttered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-by-turn scripts/wizard-ci-mcp.no-jest.ts is a stdio MCP server over one live WizardStore: read_state / list_actions / perform_action / render_screen / run_agent. An agent registers it and makes every decision live, instead of the static scripted run. Rewrite the exploring-the-wizard skill to lead with this. Bump zod ^3.24→^3.25 (the MCP SDK needs the zod/v3 subpath; non-breaking) and add the SDK as a dep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
read_state already returns the legal actions, so the separate tool is noise. Keeps the server's surface minimal: read_state, perform_action, render_screen, run_agent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hange Running prettier on these (not in lint-staged) reflowed the whole files — pure diff noise. Restore them to main and re-apply just the intended edits: the "Explore with an agent" section + the exploring-the-wizard skill row.
…d runbook EXPLORING-AS-AN-AGENT.md was promoted to .claude/skills/exploring-the-wizard/; this pointer fix was left uncommitted, so HEAD still linked the deleted file.
…ion start The skill told agents to `claude mcp add` then immediately call the tools, which is impossible (MCP servers load at session start), so agents fell back to a script. Lead with the in-session way that actually works — a WizardCiDriver script (read_state → perform_action → renderFrame), tested — and document the MCP server as the interactive option that needs registering before a fresh session.
…with it Connect the stdio transport first and build the store lazily on the first tool call — detection + the networked health probe used to run before connect(), which could stall the MCP handshake so Claude Code saw the server as broken. Verified end-to-end: `claude mcp add` → `claude mcp list` shows ✔ Connected → a headless session drove read_state → perform_action(confirm_setup) → auth → render_screen. Skill now leads with the two-phase MCP flow (register, then drive in a fresh session, since MCP tools bind at session start); the driver script is the fallback.
…drives in one session Register wizard-ci in .mcp.json so its tools are bound in every session in this repo. An agent following the exploring-the-wizard skill now drives the wizard over MCP (open_app -> read_state -> perform_action -> render_screen -> run_agent) without registering anything or starting a fresh session. The server boots app-agnostic; open_app picks the app + key at call time, so the committed config holds no secrets. Skill + README rewritten to the one-session MCP flow. Verified: a fresh headless agent given only the skill drove the wizard with four MCP calls and wrote zero scripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Just say to point appDir at the directory that has the package.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
appDir is just the throwaway copy of the app; let the agent find the path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
auth (and run) are NO_ACTION screens: session.credentials is set only inside bootstrapProgram, which runs via run_agent. So nothing advances past auth without run_agent — but the tool description said "call when currentScreen=run" and the skill walk skipped auth, so an agent landed on auth and polled instead of calling run_agent. Fix the run_agent description and the skill walk/key-facts to say run_agent bootstraps creds and advances auth+run; don't poll those screens. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ves the run
A real run_agent call blocked the stdio MCP server for ~3 minutes; the client
treated the server as unhealthy, reconnected, and the restarted process lost its
in-memory store ("No app open", runPhase reset to idle). run_agent now starts the
integration in the background and returns immediately; read_state stays responsive
and reports runPhase running -> completed plus an integration status, so the agent
polls instead of blocking. Skill + tool descriptions updated to the poll model;
noted that run_agent creates real PostHog resources each run.
Proven: run_agent returns in 0.0s; read_state during the run answers in 1-2ms with
runPhase=running.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…or both routes Both e2e routes run the real wizard TUI (startTUI) driven by store state manipulation — no keystrokes — and capture the real rendered screen from a PTY. Auth is satisfied by setCredentials with the phx key (same bearer as an OAuth token), so the TUI advances with no browser. - e2e-harness/tui-capture.ts — run a command in a PTY (node-pty), read its screen via @xterm/headless. - scripts/tui-host.no-jest.ts — the real-TUI host. MODE=fixed self-drives the fixed e2e profile, signals each screen, writes a structured result JSON; MODE=serve takes drive commands over a unix socket. - scripts/tui-snapshots.no-jest.ts — CI route: real-TUI text snapshot per screen. - scripts/wizard-ci-mcp.no-jest.ts — agent route: MCP server proxying the host. - scripts/wizard-ci-explore.no-jest.ts — drive the MCP route, print the real TUI. - scripts/tui-replay.no-jest.ts — replay captured snapshots in the terminal. Deletes the record-then-reconstruct machinery (recorder, replay, e2e-full-run, render-snapshots, replay-e2e) and the in-process wizard-ci-tools server. Adds node-pty + @xterm/headless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sition Snapshot on key moments — a screen change, a task-list update, or a runPhase change — via a store subscription, and snap each screen before the driver acts on it. The run screen (the agent working) is captured as it progresses, and fast transitions (intro/auth/outro/mcp/slack) are no longer skipped by throttling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ed loop Snapshot on every key-moment change (no throttle spacing, just a settle). And don't await the driver loop at exit — on the cheap (no-agent) path it's parked in waitForChange, so awaiting it hung the process and exited non-zero, which would fail CI. The process now exits 0 cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The fixed CI route always drives the full real agent run — a no-agent path was pointless (and is what hung at exit). Removes the RUN_AGENT branch and the auth-by-state shortcut it needed in fixed mode; auth is bootstrapped by the run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
node-pty ships no linux-x64 prebuilt, so CI must compile it; pnpm 10 blocks build scripts unless allowlisted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ink renders non-interactively when it detects CI (CI / GITHUB_ACTIONS), leaving the captured xterm buffer blank. Strip them from the spawned host's env. Verified locally: with CI=true, render_screen now returns the real TUI instead of blank. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
426da5a to
c506fea
Compare
main added the source-maps detection screen; the action-registry exhaustiveness test requires every screen be actionable or explicitly no-action. The integration e2e profile never enters the source-maps program, so it joins the other non-integration screens in NO_ACTION_SCREENS, with a note to wire it in when a source-maps profile drives that program. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
postbuild copies scripts/ into dist (which ships); drop the *.no-jest.* e2e/CI scripts from dist so the published wizard carries only runtime scripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- drop a stray blank line from posthog-integration config (no prod diff) - extract the shared intro/health-check/run sequence in tui-host - pass projectId to getOrAskForProjectData as a number (its declared type) - strip host AI_AGENT alongside CLAUDE/ANTHROPIC, matching the workbench Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
edwinyjlim
approved these changes
Jun 24, 2026
edwinyjlim
left a comment
Member
There was a problem hiding this comment.
good since it's all additive
- never write an inline api key to disk; pass it to the host via env (POSTHOG_PERSONAL_API_KEY), same as the CI path. A caller-supplied keyFile is still used as-is. - surface a failed run's error in read_state (integrationError) so CI and the agent see why the integration failed instead of a bare 'failed'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Spell out the explore walk (open_app, snapshot each key moment, act, run_agent, finish) and have it save numbered render_screen frames to /tmp/wz-explore-snaps, matching the CI route's .txt frames. Align the skill's snapshot guidance with the README example. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sarahxsanders
added a commit
that referenced
this pull request
Jun 25, 2026
Resolved 5 conflicts from main's #702/#725/#726: - runner/index.ts: combined our idempotent flushScanReport finalizer (registerCleanup + finally, return await) with main's stampVariant() calls in both fork arms - constants.ts: kept WIZARD_WARLOCK_DISABLED_FLAG_KEY; took main's removal of WIZARD_VARIANTS (variant is now runner-derived via stampVariant) - package.json: kept both new deps (@vitest/coverage-v8 + @xterm/headless); dropped main's re-added root jest config block (root is vitest now; e2e-tests keeps its own jest config) - tsconfig.json: added main's e2e-harness to include; kept our e2e-tests exclusion (standalone jest package, not in the vitest root typecheck) - pnpm-lock.yaml: regenerated via pnpm install Canonicalized main's new e2e-harness snapshots to vitest key format (content unchanged; jest used "describe test", vitest uses "describe > test"). Full suite green: 987 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
How to test
Agent route — drive the wizard yourself. In a fresh session in this repo, run the
exploring-the-wizardskill.wizard-ciis registered in.mcp.json, so the tools are already bound:open_appboots the real TUI on an app, thenread_state/perform_action/render_screen(which returns the real rendered screen).CI snapshots — real-TUI visual regression. From a
wizard-workbenchcheckout next to this repo (PostHog creds in its.env):Runs the full real agent flow against express-todo through the real TUI, captures each key moment, diffs the committed baseline, and writes
report.html. Or comment/wizard-cion a PR — same run, posted back as a comment. (Pairs with PostHog/wizard-workbench#2012.)What this is
A headless e2e control plane that drives the real wizard TUI and captures what it renders. Both routes share one primitive:
scripts/tui-host.no-jest.ts) runs the realstartTUIand drives its store by state manipulation — no keystrokes. Auth uses the phx key (same bearer as an OAuth token), so the TUI advances with no browser.e2e-harness/tui-capture.ts) runs the host in a PTY (node-pty) and reads the real rendered screen via@xterm/headless.Routes:
tui-snapshots): the fixed e2e profile self-drives the host through the real agent run → one real-TUI text snapshot per key moment (including the run screen's progression), diffed against a committed baseline.wizard-ci-mcp): an MCP server proxies the host so an agent decides each screen;render_screenreturns the real frame. Theexploring-the-wizardskill is the how-to.None of it ships — it lives in
e2e-harness/+scripts/, out ofsrc/.